{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning at Scale, Part I"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## KMeans clustering at scale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training models with data that fits in memory is very limiting. But minibatch learners can easily work with data directly from disk. \n",
    "\n",
    "We'll use the MNIST data set, which has 8 million images (about 17 GB). The dataset has been partition into groups of 100k images (using the unix split command) and saved in compressed lz4 files. This dataset is very large and doesnt get loaded by default by <code>getdata.sh</code>. You have to load it explicitly by calling <code>getmnist.sh</code> from the scripts directory. The script automatically splits the data into files that are small enough to be loaded into memory. \n",
    "\n",
    "Let's load BIDMat/BIDMach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import $exec.^.lib.bidmach_notebook_init"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And define the root directory for this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val mdir = \"../data/MNIST8M/parts/\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Constrained Clustering. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this tutorial, we are going to evaluate the quality of clustering by using it for classification. We use a labeled dataset, and compute clusters of training samples using k-Means. Then we match new test samples to the clusters and find the best match. The label assigned to the new sample is the majority vote of the cluster. \n",
    "\n",
    "This method by itself doesnt work well. Clusters will often straddle label boundaries leading to poor labelings. Its better to force each cluster to have a single label. We do that by adding the labels in as very strong features before clustering. The label features cause samples with different labels to be very far apart. Far enough that k-Means will never assign them to the same cluster. The data we want looks like this:\n",
    "\n",
    "<pre>\n",
    "           Instance 0      Instance 1      Instance 2    ...\n",
    "           has label \"2\"   has label \"7\"   has label \"0\" ...\n",
    "           /    0               0           10000         ...\n",
    "          |     0               0               0         ...\n",
    "          | 10000               0               0         ...\n",
    "          |     0               0               0         ...\n",
    "label    /      0               0               0         ...\n",
    "features \\      0               0               0         ...\n",
    "(10)      |     0               0               0         ...\n",
    "          |     0           10000               0         ...\n",
    "          |     0               0               0         ...\n",
    "           \\    0               0               0         ...\n",
    "\n",
    "           /  128              19               5         ...\n",
    "          |    47              28               9         ...\n",
    "image    /     42             111              18         ...\n",
    "features \\     37             128              17         ...\n",
    "(784)     |    18             176              14         ...\n",
    "          |    ..              ..              ..\n",
    "\n",
    "</pre>\n",
    "\n",
    "\n",
    "We chose the label feature weights (here 10000) to force the distance between differently-labeled samples (2 * 10000^2) to be larger than the distance between two image samples (1000 * 256^2). This guarantees that points will not be assigned to a cluster containing a different label (assuming there is initially at least one cluster center with each label).  \n",
    "\n",
    "Even though these label features are present in cluster centroids after training, they dont affect matching at test time. Test images dont have the label features, and will match the closest cluster based only on image features. That cluster will have a unique label, which we then assign to the test point. \n",
    "\n",
    "The files containind data in this form are named \"alls00.fmat.lz4\", \"alls01.fmat.lz4\" etc. Since they contain both data and labels, we dont need to load label files separately. We can create a learner using a pattern for accessing these files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val (mm, opts) = KMeans.learner(mdir+\"alls%02d.fmat.lz4\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The string \"%02d\" is a C/Scala format string that expands into a two-digit ASCII number to help with the enumeration.\n",
    "\n",
    "There are several new options that can tailor a files datasource, but we'll mostly use the defaults. One thing we will do is define the last file to use for training (number 70). This leaves us with some held-out files to use for testing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opts.dim = 300\n",
    "opts.nend = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the training data include image data and labels (0-9). K-Means is an unsupervised algorithm and if we used image data only KMeans will often build clusters containing different digit images. To produce cleaner clusters, and to facilitate classification later on, the <code>alls</code> data includes both labels in the first 10 rows, and image data in the remaining rows. The label features are scaled by a large constant factor. That means that images of different digits will be far apart in feature space. It effectively prevents different digits occuring in the same cluster. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tuning Options"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following options are the important ones for tuning. For KMeans, batchSize has no effect on accracy since the algorithm uses all the data instances to perform an update. So you're free to tune it for best speed. Generally larger is better, as long as you dont use too much GPU ram. \n",
    "\n",
    "npasses is the number of passes over the dataset. Larger is typically better, but the model may overfit at some point. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opts.batchSize = 20000\n",
    "opts.npasses = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You invoke the learner the same way as before. You can change the options above after each run to optimize performance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mm.train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets extract the model as a Floating-point matrix. We included the category features for clustering to make sure that each cluster is a subset of images for one digit. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val modelmat = FMat(mm.modelmat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we build a 30 x 10 array of images to view the first 300 cluster centers as images."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val nx = 30\n",
    "val ny = 10\n",
    "val im = zeros(28,28)\n",
    "val allim = zeros(28*nx,28*ny)\n",
    "for (i<-0 until nx) {\n",
    "    for (j<-0 until ny) {\n",
    "        val slice = modelmat(i+nx*j,10->794)\n",
    "        im(?) = slice(?)\n",
    "        allim((28*i)->(28*(i+1)), (28*j)->(28*(j+1))) = im\n",
    "    }\n",
    "}\n",
    "show(allim kron ones(2,2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll predict using the closest cluster (or 1-NN if you like). Since we did constrained clustering, our data include the labels for each instance, but unlabeled test data doesnt have this. So we project the model matrix down to remove its first 10 features. Before doing this though we find the strongest label for each cluster so later on we can map from cluster id to label. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val igood = find(sum(modelmat,2) > 100)                // find non-empty clusters\n",
    "val mmat = modelmat(igood,?)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "val (dmy, catmap) = maxi2(mmat(?,0->10).t)                // Lookup the label for each cluster\n",
    "mm.model.modelmats(0) = mmat(?,10->mmat.ncols)            // Remove the label features\n",
    "mm.model.modelmats(1) = mm.modelmats(1)(igood,0)\n",
    "catmap(0->100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we define a predictor from the just-computed model and the testdata, with the preds files to catch the predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val (pp, popts) = KMeans.predictor(mm.model, mdir+\"data%02d.fmat.lz4\", mdir+\"preds%02d.imat.lz4\")\n",
    "\n",
    "popts.nstart = 70                                      // start with file 70 as test data\n",
    "popts.nend = 80                                        // finish at file 79\n",
    "popts.ofcols = 100000                                  // Match number of samples per file to test file\n",
    "popts.batchSize = 10000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets run the predictor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pp.predict "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The <code>preds</code> files now contains the numbers of the best-matching cluster centers. We still need to look up the category label for each one, and compare with the reference data. We'll do this one file at a time, so that our evaluation can scale to arbitrary problem sizes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val totals = (popts.nstart until popts.nend).map(i => {\n",
    "                    val preds = loadIMat(mdir + \"preds%02d.imat.lz4\" format i);    // predicted centroids\n",
    "                    val cats = loadIMat(mdir + \"cat%02d.imat.lz4\" format i);       // reference labels\n",
    "                    val cpreds = catmap(preds);                                    // map centroid to label\n",
    "                    accum(cats.t \\ cpreds.t, 1.0, 10, 10)                          // form a confusion matrix\n",
    "}).reduce(_+_)\n",
    "\n",
    "totals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the actual and predicted categories, we can compute a confusion matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val conf = float(totals / sum(totals))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets create an image by multiplying each confusion matrix cell by a white square:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "show((conf * 250f) ⊗ ones(32,32))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Its useful to isolate the correct classification rate by digit, which is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val dacc = getdiag(conf).t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take the mean of the diagonal accuracies to get an overall accuracy for this model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean(dacc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the experiment again with a larger number of clusters (3000, then 30000). You should reduce the batchSize option to 20000 to avoid memory problems.\n",
    "\n",
    "Include the training time output by the call to <code>nn.train</code> but not the evaluation time. Rerun and fill out the table below: \n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<th>KMeans Clusters</th>\n",
    "<th>Training time</th>\n",
    "<th>Avg. gflops</th>\n",
    "<th>Accuracy</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>300</td>\n",
    "<td>...</td>\n",
    "<td>...</td>\n",
    "<td>...</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>3000</td>\n",
    "<td>...</td>\n",
    "<td>...</td>\n",
    "<td>...</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>30000</td>\n",
    "<td>...</td>\n",
    "<td>...</td>\n",
    "<td>...</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".scala",
   "mimetype": "text/x-scala",
   "name": "scala211",
   "nbconvert_exporter": "script",
   "pygments_lexer": "scala",
   "version": "2.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
